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1 Introduction: Two Types of Neglected Data 

Matching-based observational studies in education sciences often neglect data from the “rem¬ 
nant” of a match: untreated and un-matched subjects. That is, researchers will select a set 
of matched controls that most closely resemble the treated subjects, and discard data from 
the remnant, the unmatched controls. 

Similarly, due to sample size and other modeling limitations, researchers will typically 
condition their experimental and observational studies on a small set of pre-treatment covari¬ 
ates that are deemed most relevant to the study—the variables thought most likely to pose 
a confounding threat. In many cases, reams of less-relevant data are available, perhaps from 
state longitudinal data systems (SLDS) or from other sources. These less relevant covariates 
are often discarded. 

Conducting a causal analysis using only the matched sample and using only relevant 
covariates makes good statistical sense. The data from subjects that are not part of a match 
are likely to be distributed differently than data from the match. The process of matching 
encourages researchers to focus their analysis on the region of common support; the remnant 
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is typically outside this region by construction. Including irrelevant variables into an analysis 
can swamp the sample, introduce over-htting or extreme imprecision, and make impossible 
popular statistical techniques such as ordinary least squares and logistic regression. 

But these excluded data—the remnant and ostensibly irrelevant covariates—may also 
contain valuable information. Perhaps the distribution of the outcome conditional on co¬ 
variates could be estimated with more precision by vastly increasing the sample size using 
discarded subjects. Perhaps discarded covariates are not so irrelevant, and capture important 
baseline differences between treated and untreated subjects. 

This paper is an attempt to thread this needle with a new method that we call “remnant- 
based residualization,” or “rebar.” The idea of rebar is to, on the one hand, extract as much 
useful information as possible from the remnant and all available covariates, and on the 
other hand to preserve the most attractive properties of a good matching design. To im¬ 
plement rebar, we hts a machine learning prediction model to the unmatched controls—the 
“remnant”—predicting their outcomes in the control condition as a function of the entire set 
of covariates. Using this htted model, we then generate predicted outcomes for the matched 
sample. Finally, instead of calculating the effect of the treatment on participants’ outcomes 
themselves, we estimate the intervention’s effect on the difference between participants’ pre¬ 
dicted outcomes under the control condition, and their actual outcomes, i.e. their prediction 
residuals—this is “residualization.” The predictive model need not be correct in any sense, 
or consistent or unbiased for any particular parameter. It must only yield predictions that 
are closer, on average, to control potential outcomes than their mean. 

Rebar builds thematically on prior work combining matching with outcome modeling. 


such as Rubin (1973) and Ho et ah (2007a), among others, alongside “doubly robust” esti¬ 


mation (e.g. Kang and Schafer, 2007). Its most direct antecedents are Rosenbaum (2002a) 


and Abadie and Imbens (2012), which suggest forms of residualization for matching esti¬ 


mators, and Middleton and Aronow (2011) which does the same for weighting estimators. 
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Our contribution to that literature is twofold: first, rebar is remnant-based', we argue here 
that residualization is well suited to recovering otherwise lost information from the remnant. 
Second, we demonstrate by simulation and example how rebar can exploit machine learn¬ 
ing methods and high dimensional covariates without compromising the classical statistical 
properties of the match. 

Rebar can supplement a wide range of matching analyses, and may be used alongside 
other outcome models and covariate adjustments. 

The following section will review causal matching studies, and Section will formally 
introduce rebar. There, we will discuss a possible threat to the validity of a matching 
design that rebar can introduce: if the distribution of outcomes, conditional on covariates, 
differs widely enough between the remnant the matched set, rebar might increase, rather 
than decrease bias. We will introduce a diagnostic called ’’proximal validation” that should 
detect such pathological cases, and suggest ways to tweak the algorithm if a researcher were 
to confront one. 

Rebar can potentially reduce both the bias and the variance of causal estimates, by 
modeling otherwise unmodeled variation. That said, this paper will focus its attention 
on rebar’s bias reducing properties. We will argue, with analytical results (Section]^, a 
simulation study (Section]^, and an empirical example from (Section]^ that rebar is an 
effective method for reducing confounding bias from measured, but unmodeled, confounders 
in a high-dimensional dataset, without compromising the key advantages of matching. 


2 Matching in Observational Studies: Review 

In an observational study, let i = l,...,n index n subjects, and let Zi denote subject i’s 
binary treatment assignment, and Yi subject i’s observed outcome of interest. Assuming 


non-interference (Cox, 1958), and following Neyman (1990) and Rubin (1974), let i/Ti and yc. 
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denote subject i’s (perhaps counterfactual) responses were subject i treated and untreated, 
respectively. Then Yi = yTiZi+yciil — Zi). Further, let Xi be a vector of covariates measured 
prior to treatment. The potential outcomes yc and yr dehne treatment effects = yTi — ya 
and a causal estimand 


TeTT — Z/ut] 


riT 


( 1 ) 


the expected average effect of the treatment on the treated. The expectation in ([^ is taken 
conditional on the posited sampling scheme. 

In a matching-based observational study, a researcher will create a new categorical vari¬ 
able, M, considering subjects i and j to be “matched” to one another if Mi = Mj. (Subjects 
i with the property that Mj ^ Mi for aXM ^ i are unmatched.) Researchers will choose M in 
such a way that matched subjects have similar covariate distributions x. Perhaps the most 


popular approach to matching is to use propensity scores (Rosenbaum and Rubin, 1983), 
Pr{Z = l|a;), the probability of being assigned to treatment conditional on her covariates 
X. In a propensity-score matching design, treated and untreated subjects are grouped into 
matches M with approximately equal estimated propensity scores. Other inexact matching 


techniques measure subjects’ similarity in x using, for example, Mahalanobis distances (Ru¬ 


bin, 1980) or covariate balance tests (|Diamond and Sekhon 2013). Matched sets may contain 


any (positive) number of treated of treated or untreated subjects (Rosenbaum, 1991). 

Ideally, within any matched set, no subject’s a priori probability of making its way into 
the treatment group was larger or smaller than any other’s: 


Pr{Zi = l\M) = Pr{Zj = 1\M) whenever Mi = Mj] (2) 


this is perfect matching. Under perfect matching in the sense of (|^, matched comparisons are 
statistically equivalent to contrasts of treatment and control conditions in block- or paired 
randomized designs (e.g.. 


Braitman and Rosenbaum, 2002 Rubin, 2008 Hansen, 2011 
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A simple matching-based estimator compares average treated and untreated outcomes 
within each match. The average difference between treated and untreated subjects in 
matched set m is 


tiXm-, Zm) 


YZZ^ yj(i - Z„.) 

nTm ncm 


where and Z^, are vectors of Y and Z, and nxm and ncm are the numbers of treated 
and untreated among subjects {i : Mi = m}. Then a matching estimator is 


TMiX) = y^^Wmtm{y,Z) 

m 


(3) 


where weight Estimator tmOZ) is unbiased for tett under perfect matching 

(|^, or, more generally, if the difference in assignment probabilities is uncorrelated with 
control potential outcomes (Lemma in the appendix). In practice neither of these will 
be exactly true, but researchers can hope for approximate unbiasedness, and explore their 


design’s sensitivity to unmeasured (or unmodeled) bias (e.g. Gastwirth et ah, 1998 [Hosman 


et al., 2010). 


Frequently, subjects who are not sufficiently similar in x to other units are left unmatched. 
We will refer to the set of unmatched untreated subjects as the “remnant” from a match. 
Typically, the remnant is discarded. While discarding data might seem unwise, there is good 
reason to discard the remnant. Since no suitable comparisons may be found between subjects 
in the remnant and treated subjects, any causal comparisons using the remnant necessarily 
involve modeling yc as a function of X. Moreover, the remnant typically occupies a mostly 
separate region of the distribution of X than the matched sample—hence its inability to be 
matched. Therefore, comparing outcomes from treated subjects with those from the remnant 
involves extrapolation, which can be highly sensitive to model specification. On the other 
hand, the remnant may contain information that is useful for modeling yc- 

An extensive, occasionally contentious literature discusses variable selection for propen- 
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sity score models. This literature begins with Rubin and Thomas, who advised erring on the 
side of inclusiveness, striving to exclude only those covariates that a consensus of researchers 


believe to be unrelated to each outcome variable (1996, § 2.3); Rosenbaum’s (2002b, p.76) 
view is similar. Later contributions argued that including variables only weakly related to 


outcomes may increase the mean squared error (MSE) of effect estimation (Brookhart et al. 


2006 [Austin, 2011). These additional losses can in principle take the form of bias, not 


only variance, even if the MSE-increasing variable was determined in advance of treatment 


assignment (Greenland, 2003; Sjolander, 2009 Pearl, 2009), although case studies suggest 


these types of bias are often small (Liu et ah, 2012 Ding and Miratrix, 2015). Methods 
attempting to limit the MSE penalty by limiting propensity modeling variables to those 
that correlate with observed outcomes have been met with criticism of a different nature: 
In Rubin’s view, in order to maximize objectivity, during matching researchers should keep 
outcome measurements in a virtual locked box, only to emerge once the matching structure 


and other study design elements have been determined (Rubin, 2008). 

Rebar, the method of this paper, is compatible with either attitude to selection of propen¬ 
sity score variables; our illustration (§ emphasizes this compatibility by adhering to the 
more restrictive of the two schools. Without reference to outcome associations, we select 
for inclusion in the propensity model those variables we felt that a consensus of scholars 
would be most likely to deem potential confounders. In this example as in many others, 
the number of potential confounders that could be addressed in this way was limited: when 
p > riT or p > nc, then the treatment and control samples can be ordinarily be separated 
by a hyperplane, in the space spanned by X, with the result that common binary regression 
methods fail to £t (Agresti, 2013 Zorn, 2005); in the example of §[^ = 7. This heightens 

the need for additional measures for confounder control, such as rebar. 
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3 Rebar: Using an Outcome Model to Reduce Bias in 
a Matching Design 

The procedure we recommend is the following: 

1. Using the full dataset, construct a match m, perhaps based on a subset of available 
covariates, thereby dividing the sample into a matched sample and a remnant. 

2. Using units in the remnant, construct an algorithm yd') to predict yc as a function 
of the full matrix X. 


3. Assess the performance of yc{') (See Section 3.1) 


4. For all subjects i in the matched sample, use yc{-) to predict yci as jjci = ycdi)- 

5. Construct prediction errors e = Y — yc{X) for all subjects in the matched sample. 

6. Estimate treatment effects in the matched sample, substituting e for Y in the outcome 
analysis. 


As in Rosenbaum (2002a), the model yd,') relating X and yc is an algorithmic model, rather 


than a statistical model. That is, it does not estimate parameters of a probability distribu¬ 
tion, but rather generates deterministic predictions of yc when given a vector x. Since this 
procedure relies on the residuals of a model £t to U, we will refer to it as “residualization.” 


The predictions ijcd) bear some similarity to prognostic scores (Hansen, 2008). Prog¬ 
nostic scores, which are analogous to propensity scores, are statistics that are sufficient for 
the relationship between yc and x. They are commonly understood as predictions of yc as a 


function of x (e.g. Pane et ah, 2013). In fact, much of the intuition behind prognostic scores 
supports our use of ycd) here, though the prognostic score theory will not play a direct role 
in our argument. 
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Now as above, define residuals 


e = Y -yc{x). 

Then we may define “potential residuals”: ec = yc~yc{x) and e-r = yT—yci^)- Analogously 
to Y, the observed residuals are e = Zct + (1 — Z)ec- Crucially, 

GTi - eci = Ti, ( 4 ) 


where as above is subject i’s treatment effect, yxi — yci- To see this, note that yc = 
yc{X) + ec and yr = yc{X) + ct = yc{X) + ec + r. The prediction yc{x) is based 
only on pre-treatment variables x, and not on treatment status Z from subjects in the 
matched sample. That being the case, it cannot be affected by treatment status—we would 
counterfactually estimate the same yc{x) for alternative realizations of Z in the matched 
set. Therefore, we can write exi-ea = yn-ya-{yTi-ya) = yn-ya = Y- the treatment 
effect is manifest entirely in the residuals ec and ct, and not at all in yc{x). 

The prediction errors e, then, may replace Y in an outcome analysis. In particular, 
replace matched-set-specific treatment-control differences in Y, tm{Y, Z) with differences in 
e: tm{e,Z). That is, let 


Z^ Cm,Z=l ^m,Z=0 


nxm 


i:mi=m 


Zi 


1 


^Cm 


ei(l Zi) 

i:mi=m 


then define 


Yebar 


m 


(5) 


Residualization, then, means revising a matching estimator by replacing outcomes y with 
observed value/yc(') differences; it aims to rid the dependent variable of variation that is not 



informative about treatment effects. Rosenbaum (2002a) precedes conventional hypothesis 


tests with a residualization step, using observations within the matched sample to £t the 
prediction model. If one instead trains one’s prediction algorithm, yci')^ using the remnants 
of the matching procedure, the method becomes compatible with common estimation (as well 
as hypothesis testing) techniques, and may offer larger numbers of observations for training 
yci.')- Such remnant-based residualization, briefly “rebar,” is the topic of this paper. 


3.1 Cross Validation and Proximal Validation: Assessing yc{-) 


Using the remnant to model outcomes as a function of covariates affords the researcher a 
great deal of flexibility. Researchers may use data from the remnant—both covariates and 
outcomes—to attempt a variety of prediction techniques, and choose the one which performs 
best. This is particularly important when the dimension of X is large, so formulating sta¬ 


tistical models based on theory or hrst principles is hard or impossible; a variety of methods 


must be attempted. A useful tool in this regard is fc-fold cross-validation (Efron and Gong 


1983), which can estimate the predictive accuracy of a model using data from the train¬ 


ing sample. Cross-validation results may be examined for bias, variance, or other measures 
of predictive performance, but Proposition (below) suggests a focus on prediction mean- 
squared-error. In the rebar case, cross validation using data from the remnant can estimate 
^SE^gjYinant ^i&emnantijJCi yCi) ^remnant ^ MS Ej-^^nnant Q^remnanf( j/c)|~~| TheSe 
results can be used both to pick a modeling technique and to pick that technique’s tuning 


parameters. After modeling choices have been made, researchers arrive at an estimated pre¬ 
diction function yci.') ■ ^ ^ that generates predictions yc{^) as a function of covariates 

X. 

Tn defining MSEremnant and Rremnant thusly, we briefly depart from our convention of conditioning 
on potential outcomes and instead treat them as random, drawn from the same superpopulation as the 
remnant. MSEremnant and Rremnant not play a role in the theoretical development of rebar, but are 
useful heuristics in practice. 
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Cross-validation estimates an algorithm’s predictive performance when applied to new 
cases drawn from the same population as the training set. Of course, this is manifestly 
not the case for rebar. Subjects in the matched sample are likely to be different from 
those in the remnant; a model £t and cross-validated in the remnant may not perform 
as well in the matched sample as that validation would suggest. Write Sm to denote the 
matched sample, i.e. {i : 3j 7^ is.t.Mj = Mj}. One expects MSEremnant to be less than 
MSEat = ~ VaY}I\S m\) and i^remnant to be less than R\j. This is unfortunate 

but far from fatal—the more information a prediction algorithm can learn about the matched 
sample from the remnant the better rebar can reinforce a causal design. Perfection is not 
necessary. 

One does not expect MSE^ to exceed although this can 

occur. In such cases rebar could do more harm than good. Even with perfect matching in 
the sense of (|^, it could diminish efficiency; and if (|^ is only approximately true, rebar 
could increase bias as well. 

Fortunately, simple diagnostic tools can identify such pathological cases. Further, in 
many of those cases there are simple modifications to rebar that will improve its performance. 
To illustrate a diagnostic that we call “proximal validation,” consider full matching within 
calipers of width cq in terms of continuous variable or index, such as the propensity score. 
All control subjects within cq of a treated subject are matched, with remaining controls 
constituting the remnant. How well does an algorithm yc{') ht in the remnant perform 
in the matched sample? To gauge performance, a researcher will subdivide the 

remnant into two groups by using caliper ci > Cq to construct a new, larger matched set. 
The cases in the remnant that are matched under with the more permissive caliper ci are 
“proximal” cases—whether they are matched depends on the choice of caliper. The cases 
that remain unmatched even under ci are “distal” cases, unmatchable under either scheme. 
Proximal validation is re-£ts yc{') using only data from subjects in the distal remnant. 
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then examines its performance on the proximal portion of the remnant. If yc{ ) performs 
poorly when extrapolated from the remnant to the matched set, it likely also performs 
poorly when extrapolated from distal cases to proximal cases within the remnant. In other 
words, proximal validation is a way to gauge the performance of yc{') when its results are 
extrapolated in a way analogous to a matching design. 

Proximal validation is not limited to propensity-score full-matching designs with calipers; 
it may be used with any matching design that involves a quantitative restriction on allowable 
matches. The procedure, in general, will be to slightly relax that restriction, choose a second, 
more expansive match, and use the results to divide the remnant into proximal and distal 
portions. 

If yc(-)’s performance in proximal validation is discernibly worse than its cross-validation 
performance, the rebar routine should be modihed. Suppose the mechanism selecting un¬ 
treated units between the remnant and the matched sample is matching based on an esti¬ 
mated propensity score. In this case, the estimated propensity score itself can be incorporated 
into the prediction model yc{ )— for instance, by including interaction terms between the 
columns of X and d. 

Another useful diagnostic test is to check covariate balance on the predictions yc{X). 
Since yc{.X) is a covariate, a successful matching design will ensure that its distributions are 
similar among treated and matched untreated subjects. Even though yc{X) is a constructed 
variable, because the model behind it is £t without reference to the matched sample, balance 
on it can be tested in the same ways balance on manifest variables can be tested. If a balance 
test rejects the hypothesis of yc{X) balance, researchers may revise either the prediction 
algorithm yc{'), the matching scheme, or both. 
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4 Rebar’s Effects on Bias 


To see the potential of rebar to reduce the bias of a matching estimator, note that the rebar 
estimator frebar can be expressed as the difference in two estimated treatment effects: 


^rebar 


( 6 ) 


the matching estimator of the effect of the treatment on Y, minus an estimate of the effect 
of the treatment on yc{X). To see this, note that that: 


tm{e,Z)=-^ ^ etZi -^ ^ ei{l-Zi) 


nTm 




t:Alj=m 


r.k^m ) 


— XY^fi ^DfjYn- 


The expression in (|^ follows by taking weighted averages of and Aycm Of course, the 
treatment cannot have an effect on yc{X), which is a function of pre-treatment covariates 
and a separate sample; any observed “effect” of the treatment on yc{X) must be the result 
of covariate imbalance. 

Two properties of the rebar estimate follow immediately. First, 

Proposition 1. 

bias{frebar) = hias{fM{Y)) - tmOic) 

If we consider tmHic) to be an estimate of the matching estimator’s bias, then the effect 
of residualization is to subtract from the matching estimator an estimate of its bias. 
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Next, 


Proposition 2. Under perfect matching frebar is unbiased for tett■ 

This follows since, when treatment is essentially randomized within matches, "KrMiY) = 
Tett and ErM(^c) = 0. So in a successful matching design, rebar does not add bias. 

4.1 An Upper Bound on the Rebar Estimate of tett 

The closer, on average, predictions y{x) are to control potential outcomes in the matched 
set, the smaller the bias of frebar niust be. 

Proposition 3. In a matching design, the sguared bias of frebar can be bounded as 


bias{frebarY — MS Em X C{n,nT,nc) 


where MS Em = Yhi^matcheAvci ~ UaY /tim, nM is the number of subjects in the matched 
set, and 

C{n, riT, nc) = ^ ^{ncm + riTm) max ( 1, 

Ut ^ V ncm 

Eguivalently, 

f bias (^frebar') \ , /.. j-,2 \ f \ 

[ SD(yc) ) £ (!-«*<) xC(n.«T.nc) 

Where SD{yc) is the sample standard deviation of yc in the matched set and is the 
prediction in the matched set, 1 - Y^iMchediVa - Vaf I T^i^iMchediVa - ycmatched? ■ 
(The proof can be found in the Appendix) 

Remark 1. In a pair-matching design C{n,nT,nc) = 4. 

Therefore, the bias of frebar can be bounded as a function of the average squared error 
of the prediction algorithm in the matched set. Were it possible to perfectly predict all 
subjects’ yc values, their treatment effects could be estimated unbiasedly (exactly, in fact). 
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More broadly, Proposition suggests that prediction algorithms need not be based on a 
correct model to yield estimates with low bias. They must merely be accurate, on average. 
This, in turn, suggests that machine learning algorithms, whose central purpose tends to be 
prediction, can serve well as residualization mechanisms. 

In practice, the bounds in Proposition are unobservable, since they involve control 
potential outcomes in the matched set, which are only observable for the matched controls. 
Further, since the prediction algorithm yc{') is fit in the remnant, the bounds are not 
directly estimable without strong assumptions. But based on cross-validation estimates of 
MSEremnant and Rlemnanv assessmeut of seusitivity to extrapolation from 

proximal validation, researchers can formulate reasonable guesses as to the values of MS Em 
and Rj^. 

Proposition assumed nothing about subjects’ respective probabilities of treatment as¬ 
signment within matches. In particular, it allowed for a situation in which some subjects 
may be assigned to treatment with probability 1—this is a rather extreme violation of the 
stratified randomization assumption ([^. Under weak assumptions about the distribution 
of treatment assignments, the bound in Proposition may be considerably tightened. For 


instance, Rosenbaum (2002b) suggests a general model for sensitivity analysis for observa¬ 


tional studies: the assumption that for some F > 1, if = rrij —that is, i and j are in the 
same matched set—and Pi = Pr{Zi = 1) and Pj = Pr{Zj = 1), then 


1 ^ m - Pj) 

T - P,{l-P.) 


< r. 


(7) 


That is, for matched subjects i and j, the ratio of the odds that i is selected for treatment 
to the odds that j is selected is bounded by 1/F and F. The following Proposition uses the 
framework in ([^ to tighten the bound in Proposition in the simple case of a matched- 
pair design; an analogous result may hold for more complex designs, but we leave such an 
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extension for fntnre work. 


Proposition 4. In a pair-matching design, if & holds for some T > 1, then 


bias{frebarY < MSEm X 4 


rV2 _ 1 
ri/2 +1 


Eguivalently, 


f bias(^T'ir-gi,ar') A 

V SDivc) J 


— (1 ^m) ^ 4 


/rV 2 _ i\ 


(The proof may he found in the Appendix) 


Remark 2. For F = 6, which Rosenbaum (2002b, p. II 4 ) characterized as “a high degree 
of insensitivity to hidden bias,” 4 ^^ 17 ^j ~ 1-7. That is, a very weak assumption about the 
balance of treatment assignment probabilities in a matched pair design constricts the bound 
in Proposition^ by more than half. IfT = 3, the multiplier on (1 — is approximately 
one. On the other hand, as F —)■ 00 , the multiplier approaches f, as in Remark\^ 


Propositions and show that by using data from the remnant and covariate matrix X 
to predict potential outcomes yc, researchers can substantially bound the the bias of their 
treatment effect estimates. The closer the estimates are to the true values, on average, the 
lower the bound on the bias—the algorithm yc{-) need not be correct in any sense, only 
predictive. 


5 A Simulation Study 

Section gave a theoretical argument for how rebar will remove bias in a pair-matching 
design, including an upper bound for the bias of a rebar estimate as a function of the mean 
squared error of prediction algorithm yci.')- However, previous sections did not address the 
extent of the bias reduction due to rebar, or how rebar might supplement other cutting-edge 
matching algorithms. 
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To address these questions, we ran a simulation study with n =400 “subjects” and p =600 
covariates. The study imagined a researcher who knows that hve of the 600 covariates—the 
hrst hve columns of covariate matrix X —predict both yc and Z, and constructs a match 
based on those hve. The goal of the study is to determine the value of reinforcing that match 
with rebar, using an algorithm ht to all 600 covariates, under a variety of circumstances. 

5.1 Data Generating Models 

The outcomes yc were generated as a linear function of a multivariate Normal vector X\ 

UCi = + /3'^Xi^6;600 + (8) 

where the coefficients (3 are drawn from an exponential distribution with a rate of A = 5 and 
e is drawn from a standard normal distribution. A “treated” group was selected according 
to probabilities 


Fr(Zi = l|a;i) = logit + K/?'^Xi^6:6oo)- (9) 

That is, the log odds of treatment assignment were linear in covariates. We chose the 
parameter a* in such a way that, on average, riT =50 are treated. As in ([^, the coefficients 
for the hrst hve columns of X in (|^ were all set equal to 1. The coefficients of the other 
595 (= p — 5) columns in (|^ were the same as in (|^, multiplied by a factor k, which varied 
between simulation runs. 

The factor k controls the amount of confounding after matching. When k = 0, only the 
hrst hve columns of X, the matching covariates, predict Z, so estimates from the match 
should be approximately unconfounded. When k > 0, every column of X predicts both Z 
and yc, and therefore confounds matching estimators that use only the hrst hve columns of 
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X. As K increases, so does the magnitude of the bias due to confounding after the match; the 
three values we assigned kappa (0,0.1,0.5) roughly correspond to no unmatched confounding, 
low unmatched confounding, and high unmatched confounding. 


A second parameter, p, controlled the covariance structure of X, effectively controlling 
the ease of predicting yc as a function of X. In this simulation, p =0, 0.004, and 0.05. The 
rows of X were generated from a p =600-dimensional multivariate normal distribution, with 
a random covariance matrix whose eigenvalues we specihed (it was generated with R code of 


Varadhan 2008). We set these eigenvalues eufc, = 1, ...,600, to decay exponentially: evk = 


exp{—pk}. When p = 0, all eigenvalues are unity, and the columns of X are uncorrelated. 
As p increases, the columns of X become increasingly correlated: there is low-dimensional 
structure in X. Prediction algorithms typically perform better when high-dimensional X 
can be summarized with a low-dimensional structure. During the simulation we recorded 
the estimated prediction from the cross-validation, and models £t to X with higher p fit 
substantially better. 


5.2 Treatment-Effect Estimators 

Each round of the simulation began by constructing three matches: an optimal propensity 
score pair match (PSM), a propensity-score nearest-neighbor match (NN), and a coarsened 
exact match (CEM). Relative to each of these we recorded an ordinary matching estimate 
(|^, based on (z, y, m) information from the matched sample, as well as a rebar estimate (|^ 
that also used x, and observations from the remnant. For NN and CEM we additionally 
calculated a “bias adjusted” effect estimate that uses the matched sample to model the 
relationship between x and the outcome, and we incorporated this form of adjustment into 
the rebar estimate as well, using the matched sample to model the relationship of x to rebar 
residuals, e = Y — yc{x), as opposed to ordinary outcomes Y. 

We estimated propensity scores using logistic regression, with Z regressed on the matching 
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covariates, the first five columns of X. For the pair match PSM, we used the pairmatch 


routine from the optmatch package in R (R Development Core Team, 2011; Hansen, 2007) to 
construct an optimal pair match without replacement—each treated subject was matched to 
a unique control subject in such a way that the total distance in propensity scores between 
matched subjects was minimized. (Pair matching was chosen for ease of interpretation, not 
because it is the best or most easily generated without-replacement matching structure; for 
instance, the application of §[^uses optmatch to pair each treatment group member to 1-4 
controls.) Then, the matching estimator was (|^, the average difference in Y between treated 
subjects and their matched controls. 


The “nearest-neighbor” routine proposed by Abadie and Imbens (2006), and implemented 


by Sekhon (2011), matches each treated subject to the untreated subject with the most sim¬ 
ilar propensity score, allowing some untreated subjects to be matched to multiple treated 
subjects. The NN matching estimator was the “ATT” estimator of Abadie and Imbens 


(2006): the average of the differences between each treated subject’s outcome and the av¬ 
erage outcome of its matched controls. Note that, because NN matching matches subjects 
with replacement, a control subject’s outcome may appear multiple times in the matching 
estimator. 


Abadie and Imbens (2012) suggests adjusting nearest-neighbor matching with an 


outcome model: ordinary least squares (OLS) regression £t to the matched sample]^ Since 
OLS cannot be £t when the number of covariates exceeds the sample size, we used only the 
matching covariates for the bias adjustment. 


CEM (lacus et al., 2011) “coarsens” each continuous matching variable by recoding as 
discrete, with a pre-set number of bins, and then matches exactly within those bins. We 


implemented CEM with the cem package in R (lacus et ah, 2015) with hve bins and esti¬ 
mated matched treatment effects with ([^. Ho et al. (2007b) suggests estimating parametric 


Abadie and Imbens (2012) in fact suggest a more complicated regression routine that includes non¬ 


linear terms and interactions as the sample size grows, but in practice implement the routine with OLS; the 
matching package similarly uses OLS. 
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models for treatment effects using only data from the matched sample. To implement this 
adjustment, we regressed Y on the matching covariates and Z in the matched sample, and 
recorded the coefficient on Z. 


Each matching estimator left a remnant: control subjects which were not sufficiently 
comparable to the treated subjects to be included in a match. We used each of these three 
remnants to predict control potential outcomes in the matched samples. In most applications, 
as in the example we present below (Section]^, researchers will use cross-validation to choose 
between, or combine, several prediction algorithms. However, for the sake of simplicity, we 


used only the LASSO (Tibshirani, 1996), and random forests (Breiman, 2001) implemented 


in R with the glmnet and randomForest packages ([Friedman et ah, 2010 Liaw and Wiener 


2002), tuned and combined with SuperLearner package (Polley and van der Laan, 2014) 


to minimize mean-squared-error. Then, in each simulation run, we recorded three rebar 
estimates, one for each match. The PSM rebar estimator was as described in Section The 
NN rebar estimator used both within-sample bias adjustment and rebar: the LASSO was 
fit in the remnant, yielding residuals e for the matched sample; then the bias-adjusted NN 
estimator was re-computed substituting e for Y. Similarly, the GEM rebar estimate was the 
coefficient on Z from a regression of e on Z and the matching covariates fit within the matched 
sample. (In contrast to NN and GEM, PSM was not combined with within-matched-sample 
covariate adjustments; this helped to limit the number of simulation settings.) 


5.3 Simulation Results 
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Figure 1: Boxplots of treatment effect estimates from 1000 simulation runs under the data generating models in Section 
5.1 The true treatment effect, indicated by a horizontal dotted line, is zero. The matching methods, optimal pair 


matching (PSM), nearest-neighbor matching (NN), and coarsened exact matching (CEM), were either unadjusted (white) 
or adjusted with rebar alone (red), with within-sample adjustment (blue) or both (green). The nine simulation scenarios 
are arranged in a matrix, with rows for k =0, 0.1, and 0.5, denoting levels of confounding after the match, and columns 
for p =0, 0.004, and 0.05, from left to right, denoting correlation structure between covariates. The Rlemnant values listed 


are averages of prediction R: 


remnant 


for yc{') estimated using cross-validation within the remnant. 



































Figure shows the results of the simulation, after 1000 simulation runs. Each row of 
Figure [T] corresponds to a value of k; in the hrst row, k =0, corresponding to approximately no 
confounding from the covariates not used in the match, in the second row 0.1, corresponding 
to moderate confounding from the left-out covariates, and in the third row 0.5, corresponding 
to a high degree of confounding. Each column of Figure [T] corresponds to a different value 
of p: 0, 0.004, and 0.05. These correspond to datasets increasingly amenable to prediction 
algorithms; the top of the hgure lists the average cross-validation Rremnant of Vci') hf in fhe 
remnant from PSM in the k = 0 case {R^ values for other models and other values for k, 
were similar). 

Each panel of Figure [T] displays boxplots of nine treatment effect estimates: three match¬ 
ing estimates in white, from optimal pair (PSM), nearest-neighbor (NN), and coarsened exact 
matching (CEM). NN and CEM are shown alongside estimates incorporating within-sample 
bias adjustments, shown in blue. Rebar estimates are shown for each matching design: for 
PSM, rebar is the only outcome model adjustment, and it is shown in red; for NN and CEM, 
rebar and within sample adjustments are combined and shown in green. 

A number of patterns are apparent. When k = 0 and the covariates not used in the 
match do not pose a confounding threat, all the estimators (with the slight exception of 
CEM) are unbiased. Both within-sample bias reduction and rebar reduce the variance of 
the effect estimates, subtly for the hrst two columns and dramatically in the third. As 
K, or confounding from the non-matching covariates, increases, all effect estimates become 
increasingly biased. However, rebar substantially reduces the bias in all cases. Rebar is 
similarly effective when used on its own and when used in conjunction with within-sample 
outcome model adjustments—that is, rebar has quite a bit to add even after other adjust¬ 
ments. Unsurprisingly, rebar’s performance, both in terms of bias and variance reduction, 
improves with higher Rremnant —^^e closer, on average, the predictions yc{^) are to yc in 
the remnant (and, presumably, in the matched set, too), the more good rebar can do. 
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This simulation study shows rebar’s potential: in at least some scenarios, rebar can 


substantially reduce both the bias and the variance of a matching estimator. 


5.4 Rebar’s Performance Under Non-Linearity 

We conducted a parallel simulation study to investigate rebar’s performance when the dis¬ 
tribution of j/c, conditional on X, differs greatly between the remnant and the matched set. 
Since it is the match that determines which subjects are in the matched set and which are in 
the remnant, and the data generation occurs prior to the match, we could not determine the 
distribution of yc in the remnant exactly. Instead, we let the data generating model for yc 
vary with Pr{Z = 1), subjects’ probabilities of being treated. To do so, we modihed both 
the outcome model ([^ and the treatment model §. To select treated subjects, we chose 
those 2 nT with the highest linear predictors, as dehned in equation (|^, and assigned half 
to treatment. That left an “untreatable” group of subjects with Pr{Z = 1) = 0. For the 
untreatable subjects, yc was generated as in For the 2nr subjects with Pr{Z = 1) = 0.5, 


the outcomes were generated as x(3* — x(3* -\- e, where (3* is the concatenation of a vector of 
hve Is with (3. Finally, we transformed yc to —yc, so that the omitted variable bias would 
be positive, as in Section 5.3[ In this study, the relationship between x and yc for subjects 
who could be treated was precisely the opposite of the relationship for subjects who could 
not. The worry here is that yc{') will be severely misleading, if it is £t in the remnant and 
extrapolated to the matched set. 

The simulation results suggest that this is, indeed, a concern—in some cases. Figure 
shows the results of rebar adjustment to optimal PSM using two different rebar algorithms 
yc{-)'- LASSO, which depends on a linear model, and random forests (run with the ranger 
package Wright and Ziegl^2015), which does not. Rebar adjustment with LASSO worsened 


the bias and variance of the PSM estimator, sightly for lower P: 


remnant 


values and considerably 


for higher R. 


remnant' 


On the other hand, rebar using random forests, which achieved much 
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{LASSO) = 0.28 {LASSO) = 0.85 

Rle^nantim = 0.04 KemnantiRF) = 0-45 




Figure 2: Boxplots of treatment effect estimates from 500 simulation runs under the data 
generating models in Section 5.4 The true treatment effect, indicated by a horizontal dotted 
line, is zero. The methods are optimal pair matching (PSM), colored white, and rebar- 
adjusted PSM (red), with yc predicted using LASSO or random forests. The six simulation 
scenarios are arranged in a matrix, with rows for k =0 and 0.5, denoting levels of confounding 
after the match, and columns for p = 0 and 0.05, from left to right, denoting correlation 
structure between covariates. The Rlemnant values listed are averages of prediction B? 
for yc{') estimated using cross-validation within the remnant. 


remnant 
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lower Rremnant values acioss the board, did little to no damage to the PSM estimator. 
Apparently the matching routines were unable, in general, to perfectly identify the treatable 
control subjects with Pr{Z = 1) = 0.5, so both the remnant and the matched set contained 
subjects with outcomes drawn from both outcome models. While the structure of the linear 
model allowed LASSO to maintain a close £t to the data—with unfortunate consequences 
for rebar—random forest’s sensitivity to non-linearity led to worse model £t in the remnant, 
and better performance in rebar. 

In summary, rebar with a linear adjustment model somewhat worsens MSE under data 
generating models combining nonlinear responses with limited propensity score overlap. The 
losses are mitigated by modes of rebar adjustment that are better aligned with prevailing 
associations of responses and covariates, and even without mitigation they are smaller than 
the improvement rebar offers under less pathological scenarios. 


6 Example Data Analysis: Evaluating Board Exam 
Systems 


Board Exam Systems (BES) comprise a class of similar comprehensive educational reforms. 
BES are packages that a school can adopt: sets of rigorous curricula for all academic courses, 
corresponding sets of end-of-course exams, professional development and instructional guid¬ 
ance for teachers and systems of assistance for struggling students. Though uncommon in 
the United States, BES are common around the world, and several research studies have 
suggested that they improve student achievement (Bishop, 1997, 2000| Collier and Millimet 


2009) 


Seven Arizona High Schools began implementing BES programs in the 2012-2013 school 
year: either the ACT Quality Core program or the Cambridge program. A pilot study sought 
to evaluate the results after one year, in part by estimating the effects of the BES programs 
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on lOth-graders’ end-of-year standardized test scores—specifically, the Arizona Instrument 
to Measure Standards, or AIMS. Here we present a simplified version of the study’s estimate 
of the effect of BES on school-average lOth-grade AIMS Reading scores. The analysis we 
present here is intended to illustrate the rebar method, not to evaluate the effectiveness of 
BES programs in Arizona. 

For Arizona high schools in our sample, we had four years of pre-treatment data. That 
is, data from four cohorts of students who preceded the adoption of BES—students set 
to graduate in 2011-2014. For each cohort, we have the total enrollment, the percents 
of students who are male, white, black, Hispanic, other race or ethnicity, receiving free 
or reduced-price lunches (ERL), special education (SPED), and English language learners 
(ELL), in addition to average 8th Grade and 10th Grade AIMS scores on writing, reading, 
math and science. We also have the percent of students in each cohort with missing AIMS 
English and Math scores. From these data, we computed composite scores by averaging the 
four components, and school “trends” for 10th grade math and reading scores: ordinary- 
least-squares slope estimates from the school-level regressions of school mean AIMS scores 
on a linear time variable. From the US National Genter for Education Statistics Gommon 


Gore of Data (NGES, 2013), we have a categorization of each school into one of 10 categories 
of urbanicity, ranging from urban to remote rural. All in all, there are 90 covariates, for a 
total of 509 high schools. 


6.1 A Propensity Score Match 

To estimate effects, then, we began with a propensity score match. Since there are only ut = 
7 intervention schools, logistic regression with all 90 predictors was not feasible. Instead, 
our propensity score model incorporated only a small subset of the covariates, those that 
we believed would be most recognizable as potential confounders to the end audience of the 
research. Specifically, we regressed schools’ BES status on the percent ERL, white, SPED, 
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Hispanic, and average and percent missing 8th and 10th grade AIMS scores for stndents in 
the cohort immediately prior to BES implementation (those set to gradnate in 2014) along 
with estimated school trends in English and Math AIMS scores. Since this still gave more 
predictors than there were observations in the treatment gronp, we expected that classical 
logistic regression wonld fail to £t, so we instead nsed the Bayesian variant implemented in 


the arm library for R (Gelman and Sn, 2015 Gelman et al.| 2008). 


We constrncted optimal propensity-score matches, using the R optmatch package (Hansen 


2007) to minimize paired differences in the estimated log odds of assignment to treatment. 


Given the relatively large pool of available comparison schools, we disallowed the sharing of 
controls, as in nearest-neighbor matching or full matching, while permitting multiple matches 
per treatment schools. Rather than leaving the maximum number of matched comparisons 
per treatment unspecihed, we restricted it to 4, a restriction that reduces the overall informa¬ 


tion content of the matched sample (Ginar and Zubizarreta, 2016) only modestly relative to 
matching without an upper limit on the number of matched controls per treatment. (Each 
matched set m makes a contribution to effective sample size comparable to hijiTrmncm) 


matched pairs, where h{nT,nc) = )} ^ harmonic mean of ut and nc 


Hansen, 2011 Ginar and Zubizarreta, 2016|. For ut = 1 and n^c ^ 1) this contribution 


varies between 1 and 2, with h{l, 4) = 1.6.) If this left plausible matches for some treatment- 
group schools on the table, these eligible but unused comparisons would enhance the value 
of proximal validation, improving its ability to detect shortcomings of the extrapolation that 
underlies rebar. 

Table displays covariate balance for the variables in the propensity score model— 
standardized differences in covariate means and Z-scores—before and after matching. Go- 
variate balance was assessed with the xBalance routine in the RItools package from R 


(Bowers et ah, 2010). The xBalance routine also returns the results of omnibus balance 
tests, for the full sample and the matched sample. They returned p-values of 0.04 and 0.71, 
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respectively. Evidently, the propensity score match controlled some covariate imbalance that 


was in the full sample. 

6.2 Rebar to Adjust the Match 


Std. Diff. 

Unmatched Matched 

% FRL 

1.06 

** 

0.08 

% White 

-0.97 

* 

0.02 

% Sp.Ed. 

-0.01 


-0.19 

% Hispanic 

1.34 

*** 

0.03 

Urban 

0.24 


0.13 

avg. AIMS Writing (8th) 

0.31 


-0.10 

avg. AIMS Reading (8th) 

0.42 


-0.18 

avg. AIMS Math (8th) 

0.79 

* 

0.06 

avg. AIMS Reading (10th) 

-0.55 


0.14 

avg. AIMS Math (10th) 

-0.27 


0.05 

avg. AIMS Writing (10th) 

-0.46 


-0.01 

trend: AIMS English (10th) 

-0.37 


0.11 

trend: AIMS Math (10th) 

-0.42 


0.10 

% AIMS Eng. Missing 

-0.27 


-0.17 

% AIMS Math Missing 

-0.20 


-0.22 

yc{x) 

-0.05 


0.16 


Table 1: Standardized differences testing balance on covariates from the propensity score 
model and predictions yci^) in fh® entire sample of schools and for the matched sample, 
conducted with the xBalance procedure. 


6.2.1 Estimating yc{-) 



LASSO 

Random .Forest 

BayesLM 

Ridge 

Mean 

RMSE 

19.18 

15.73 

44.92 

19.57 

26.89 


0.49 

0.66 

-1.79 

0.47 

-0.00 

coefficient 

0.00 

1.00 

0.00 

0.00 

0.00 


Table 2: CV root-mean-squared error, R^, and ensemble learner weight from the Super 
Learner. The seven models displayed are the LASSO, Random Forest, a linear model with 
weak priors on the coefficients (“BayesLM”), Ridge regression, and a grand mean model 
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After setting aside the treated schools and their nntreated matches, there were 483 schools 
in the remnant. We considered fonr different predictive modeling strategies to constrnct 


yc{')'- the LASSO (Tibshirani 1996, implemented in R via Friedman et ah 2010), ridge 


regression (Hoerl and Kennard, 1970; Venables and Ripley, 2002), linear regression with 


weak priors for regnlarization (Gelman and Sn, 2015), and random forests (Breiman, 2001), 


along with grand-mean prediction, all combined via the Super Learner ( |Polley and van der 
Laan| 2014). The Super Learner uses cross validation to estimate the predictive accuracy 
(measured in prediction mean-squared-error) of each of the modeling algorithms in a library. 
Then, it constructs an “ensemble learner,” predicting new values as a weighted average of the 
predictions from each of the algorithms, with the weights determined by the cross-validation 
results. These results are displayed in Table Apparently, the Random Forest dominates 
the other algorithms, with a prediction E? of 0.66, to the extent that its ensemble weight is 
1 . 


6.2.2 Proximal Validation 


To gauge how model trained on the remnant might perform on the matched sample, we 


conducted proximal validation, described in Section |3.1[ First, we constructed a second 
match, identical to the hrst, but allowing each treated subject to match at most 10 

control subjects. This resulted in =452 unmatchable distal schools as 

a training set, and “ l[|{ym,=mfR|=i]) =^7 “proximal” schools as a 

testing set. We then trained the Super Learner on the distal schools, and computed its 
prediction accuracy against the proximal schools. Somewhat surprisingly, the prediction 
models performed better when trained on the distal schools and tested on the proximal 
schools than when both the training and testing sets were the entire remnant, as in cross- 
validation. This may be a result of sampling error, or the fact that the distal set contains a 
number of outlier schools whose AIMS reading scores are particularly hard to predict. These 








































Test Scores (Remnant) 


(A) 


Test Scores (Proximal Remnant) 


(B) 


Figure 3: Super Learner prediction accuracy: predictions {yc^X)) as a fnnction of real 
test scores. (A) gives the results of the Super Learner £t to, and tested against, the entire 
remnant. (B) shows the proximal validation results: the performance of the Snper Learner 
fit in the distal portion of the remnant and tested against the proximal portion. The figures 
also contain the y = x line for comparison. 
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schools will increase the estimated MSE reported by any validation method that includes 
them in its testing set. If there are no outlier schools in the proximal set, proximal validation 
will not suffer from this difficulty. 

As an additional check of the identihcation assumption ([^ for match m, we tested balance 
on yc{.X), in the same way as for other covariates: we tested if /ut = ~ ^)lnc- 

The resulting p-value from the xBalance routine was 0.46; the balance test on yc{X) does 
not falsify ([^. 


6.2.3 Estimating Treatment Effects 



Estimate 

SE 

p-value 

95% Cl 

PSM 

5.91 

4.98 

0.22 

(-5.9,18.3) 

rebar 

1.3 

3.64 

0.32 

(-4.1,9.8) 


Table 3: The average treatment effect on the treated tett, along with regression standard 
errors and permutational p-values and 95% conhdence intervals, estimated with conventional 
propensity-score matching, as described in Section 16.11 and with rebar. 


Finally, we calculated both tm, the matching estimator using Y, and frebar, the rebar 
matching estimator, along with HC3 standard error, shown in Table To estimate p- 
values, we conducted permutation tests, permuting treatment indicators within matched 
sets and re-computing the estimates. Ninety-five percent conhdence intervals were estimated 


by inverting the permutation test, as in Rosenbaum (e.g. 2002a). Neither the conventional 
method nor rebar detected a statistically-signihcant effect. However, the rebar estimate 
resulted in a conhdence interval with less than half the width of the conventional interval. 


7 Conclusion 

In structural engineering, “rebar” abbreviates “reinforcement bar,” metal beam that is em¬ 
bedded in concrete. Concrete is resistant to compression, whereas rebar is resistant to 


check! 
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tension; the combination of the two materials, rebar and concrete, is robnst to a variety 
of threats. Similarly, the rebar method of this paper complements the use of matching for 
confounder control. Whereas matching typically focuses primarily on possible confounders’ 
associations with the treatment variable, and typically leaves some subjects unmatched, re¬ 
bar addresses bias by using the the remnant from matching, the unmatched controls, to 
model possible confounders’ associations with outcomes. The predictions that result, yci^), 
extract information about subjects’ control potential outcomes from the covariates X. The 
process of residualizing, that is, subtracting predictions yc{x) from outcomes F, can neu¬ 
tralize confounding from variables that the match failed to balance. 

Residualizing using the remnant confers these benehts without compromising the statis¬ 
tical rationale for matching. Indeed, matching supplemented with rebar inherits a number 
of central attractions of the matching estimator. For instance, researchers with any level of 
statistical training can assess the success of the matching procedure by examining matched 
units’ comparability on substantively meaningful baseline variables. Although it typically 
makes use of data from outside the range of common support—the set of subjects i for which 
0 < Pr{Zi = l|xi) < 1—its hnal estimate frebar compares only matched subjects, observing 
any common support restrictions that the matching procedure observed. The procedure is 
compatible with postponing analysis involving outcomes until the process of matching is 


complete, as recommended by Rubin (2008). If matching succeeds in recreating a latent 
experiment, where subjects matched to each other were assigned to treatment randomly, 
then frebar, like tm, is unbiased. 

Generating predictions yc{x) involves extrapolating from the remnant to the matched 
sample; in some circumstances, the method could worsen the quality of matched inferences. 
This risk is mitigated with the use of cross-validation, to limit overhtting of the prediction 
model, followed by proximate validation, which additionally detects biases specihc to extrap¬ 
olation from lower- into higher-propensity score regions of x-space. Both forms of validation 
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are assisted by the presence of a sizable matching remnant, including at least controls that 
would have been suitable matches for some treatment group members. While compatible 
with any method of matching that leaves a positive fraction of the control reservoir un¬ 
matched, rebar is particularly attractive in observational studies with many more untreated 
than treated subjects. 

We have focused on the capacity of rebar to reduce bias, but the method may have other 
benehts as well. For instance, the confidence interval from a rebar analysis of the BES data 
had less than half the width of the conhdence interval from the corresponding matching 
analysis. Indeed, conhdence interval widths and standard errors generally vary inversely 
with the variance of the outcome. Unless the rebar extrapolation is sufficiently unstable as 
to worsen MSE — within the matched sample, the mean-square difference between rebar’s 
out-of-sample prediction and Y exceeds the variance of U — conhdence intervals based on e 
are bound to be tighter than those based on Y alone. In addition, studies with more stable 


outcomes tend to have lower design sensitivity (Rosenbaum, 2010 Zubizarreta et ah, 2013). 
Barring instability, the rebar analysis will be less sensitive to confounding from unmeasured 
or unmodeled variables. The relative stability of e and Y is rehected in the prediction 
of the rebar yc{-) when applied to the matched set, for which cross-validation and proximal 
validation can suggest a plausible range. 
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8 Appendix: Proofs of Propositions and 


8.1 The Bias of tm 

Lemma 1. In a matching design where the target of estimation is tett, the bias of matching 
estimator 0 is 


E[f m(Y)] - tett = 

m 


Pm 

^Tm 


1 Pm ^ 

ncm 


where ycm is the vector of yc values for all subjects for whom m, = m: {yci}mi=m, cind Pm 
is a vector of probabilities of treatment assignment for subjects in m, given UTm and ncm'- 
Pi P^i^Xi I^nEmy’aCm') ■ 
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Proof. All of the following expectations are taken conditional on ...,ncM and hti, 


Etm = E ^ Wmtm{y, Z) 

m 

= Z) 


Next, for a particular match m, 


Etr^{Y,Z) = E[—YZZr^ - — rj(l - z^)] 

riT ncm 


— E[ ycmZn 
tit 


^ - Z„)] + 


'^Cm 
1 


nxm 


— ...T Tirr ^ V ^ l '1 '7 M I 

~ VCm^i Zyn (1 Zraj\ + 


nTm 


‘^Cm 


'^Tm 


_ ( Ppp 1 pTTi'^ I 


^Tm ^Cm 


^Tm 


Then note that = ^ett 


□ 


8.2 Proof of Proposition 

Proof. As in Proposition the squared bias of frebar is 


btO/S (ffrebar') 


y^WmiVc m - ycr 


^Tm 


^Cm 


n 2 


Let yc and yc be length-n vectors, concatenations of ycm and ycm- For i = 1, • • • ,n let 
Qi = Wmi - np") Q be a concatenation of {Qi}, a length-n vector. Since 
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0 < Pj < 1, \Qi\ < max 


1 1 


'^Trrij '^Crrij 


Wm.- Then 


bias‘d {irebar) = [{VC “ VcTQY 


< n ——by Cauchy-Schwartz 


n 


< n- 


\yc - yc\ 


n 


E 


max 


riTmi ncmi 


w 


mi 


\yc - ycW 


n 


Tin 


'^{ncm + nTm) max ( 1, 


nTm 

^Cm 


□ 


8.3 Proof of Proposition 

Proof. The proof follows the form of the proof of Proposition but exploits the fact that 
Qi < (T^/^ — l)/(r^/^-|-l)/nT. This follows from two facts: first, in a matched pair design, if 
i is matched to j and i ^ j, Pi = 1 — Pj, so ([^ can be re-written as l/T < Pf/{1 — PiY < T. 
Secondly, in a matched pair design, the term Pi/riTmi — Pj/ncmj can be written as 2Pj — 1. 
The result follows. □ 
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